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Abstract. Entity resolution (ER), an important and common data cleaning prob¬ 
lem, is about detecting data duplicate representations for the same external en¬ 
tities, and merging them into single representations. Relatively recently, declar¬ 
ative rules called matching dependencies (MDs) have been proposed for speci¬ 
fying similarity conditions under which attribute values in database records are 
merged. In this work we show the process and the benefits of integrating three 
components of ER: (a) Classifiers for duplicate/non-duplicate record pairs built 
using machine learning (ML) techniques, (b) MDs for supporting both the block¬ 
ing phase of ML and the merge itself; and (c) The use of the declarative language 
LogiQL -an extended form of Datalog supported by the LogicBlox platform- for 
data processing, and the specification and enforcement of MDs. 
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1 Introduction 

Entity resolution (ER) is a common and difficult problem in data cleaning that has to 
do with handling unintended multiple representations in a database of the same external 
objects. Multiple representations lead to uncertainty in data and the problem of manag¬ 
ing it. Cleaning the database reduces uncertainty. In more precise terms, ER is about the 
identification and fusion of database records (think of rows or tuples in tables) that rep¬ 
resent the same real-world entity Elia. As a consequence, ER usually goes through 
two main consecutive phases: (a) detecting duplicates, and (b) merging them into single 
representations. 

Eor duplicate detection, one must first analyze multiple pairs of records, comparing 
the two records in them, and discriminating between: pairs of duplicate records and 
pairs of non-duplicate records. This classification problem is approached with machine 
learning (ML) methods, to learn from previously known or already made classifications 
(a training set for supervised learning), building a classification model (a classifier) for 
deciding about other record pairs mniiia. 

In principle, in ER every two records (forming a pair) have to be compared, and then 
classified. Most of the work on applying ML to ER work at the record level Gsiiioiini, 
and only some of the attributes, or their features, i.e. numerical values associated to 
them, may be involved in duplicate detection. The choice of relevant sets of attributes 
and features is application dependent. 

ER may be a task of quadratic complexity since it requires comparing every two 
records. To reduce the large number two-record comparisons, blocking techniques are 


used EmSlIll. Commonly, a single record attribute, or a combination of attributes, the 
so-called blocking key, is used to split the database records into blocks. Next, under the 
assumption that any two records in different blocks are unlikely to be duplicates, only 
every two records in a same block are compared for duplicate detection. 

Although blocking will discard many record pairs that are obvious non-duplicates, 
some true duplicate pairs might be missed (by putting them in different blocks), due 
to errors or typographical variations in attribute values. More interestingly, similarity 
between blocking keys alone may fail to capture the relationships that naturally hold in 
the data and could be used for blocking. Thus, entity blocking based only on blocking 
key similarities may cause low recall. This is a major drawback of traditional blocking 
techniques. 

In this work we consider different and coexisting entities. For each of them, there 
is a collection of records. Records for different entities may be related via attributes in 
common or referential constraints. Blocking can be performed on each of the partic¬ 
ipating entities, and the way records for an entity are placed in blocks may influence 
the way the records for another entity are assigned to blocks. This is called “collec¬ 
tive blocking”. Semantic information, in addition to that provided by blocking keys for 
single entities, can be used to state relationships between different entities and their 
corresponding similarity criteria. So, blocking decision making forms a collective and 
intertwined process involving several entities. In the end, the records for each individual 
entity will be placed in blocks associated to that entity. 

Example 1 . Consider two entities. Author and Paper. For each of them, there is a set 
of records (for all practical purposes, think of database tuples in a single table). For 
Author we have records of the form a = {name ,..., affiliation,..., paper title ,...), 
with {name, affiliation} the blocking key; and for Paper, records of the form p = 
{title,..., author name,...), with title the blocking key. We want to group Author 
and Paper records at the same time, in an entwined process. We block together two 
Author entities on the basis of the similarities of authors’ names and affiliations. 

Assume that Author entities ai, 3.2 have similar names, but their affiliations are not. 
So, the two records would not be put in the same block. However, ai, a2 are authors 
of papers (in Paper records) pi,p2, resp., which have been put in the same block 
(of papers) on the basis of similarities of paper titles. In this case, additional semantic 
knowledge might specify that if two papers are in the same block, then corresponding 
Author records that have similar author names should be put in the same block too. 
Then, ai and a2 would end up in the same block. 

In this example, we are blocking Author and Paper entities, separately, but collec¬ 
tively and in interaction. ■ 

Collective blocking is based on blocking keys and the enforcement of semantic informa¬ 
tion about the relational closeness of entities Author and Paper, which is captured by a 
set of matching dependencies (MDs). So, we propose “MD-based collective blocking” 
(more on MDs right below). 

After records are divided in blocks, the proper duplicate detection process starts, 
and is carried out by comparing every two records in a block, and classifying the pair 
as “duplicates” or “non-duplicates” using the trained ML model at hand. In the end. 
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records in duplicate pairs are considered to represent the same external entity, and have 
to be merged into a single representation, i.e. into a single record. This second phase is 
also application dependent. MDs were originally proposed to support this task. 

Matching dependencies are declarative logical rules that tell us under what condi¬ 
tions of similarity between attribute values, any two records must have certain attribute 
values merged, i.e. made identical insiini . For example, the MD 

Deptg[depi\ « Deptg[depi\ Deptg[city\ = Deptg[city] (1) 

tells us that for any two records for entity (or relation or table) Deptg that have similar 
values for attribute dept attribute, their values for attribute city should be matched, i.e. 
made the same. 

MDs as introduced in ifTTll do not specify how to merge values. In ||6]|71, MDs were 
extended with matching functions (MFs). For a data domain, an MF specifies how to 
assign a value in common to two values. We adopt MDs with MFs in this work. In the 
end, the enforcement of MDs with MFs should produce a duplicate-free instance (cf. 
Sectionj^for more details). 

MDs have to be specified in a declarative manner, and at some point enforced, by 
producing changes on the data. For this purpose, we use the LogicBlox platform, a data 
management system developed by the LogicBlo?|^ company, that is centered around 
its declarative language, LogiQL. LogiQL supports relational data management and, 
among several other features m, an extended form of Datalog with stratified negation 
Q. This language is expressive enough for the kind of MDs considered in this workj^ 

In this paper, we describe our ERBlox system. It is built on top of the LogicBlox plat¬ 
form, and implements entity resolution (ER) applying to LogiQL, ML techniques, and 
the specification and enforcement of MDs. More specifically, LRBlox has three main 
components; (a) MD-based collective blocking, (b) ML-based duplicate detection, and 
(c) MD-based merging. The sets of MDs are fixed and different for the first and last 
components. In both cases, the set of MDs are interaction-free Q, which results, for 
each entity, in the unique set of blocks, and eventually into a single, duplicate-free in¬ 
stance Q. We use LogicQL to declaratively implement the two MD-based components 
of LRBlox. 

The blocking phase uses MDs to specify the blocking strategy. They express con¬ 
ditions in terms of blocking key similarities and also relational closeness (the semantic 
knowledge) to assign two records to a same block (by making the block identifiers iden¬ 
tical). Then, under MD-based collective blocking different records of possibly several 
related entities are simultaneously assigned to blocks through the enforcement of MDs 
(cf. Sectionj^for details). 

On the ML side, the problem is about detecting pairs of duplicate records. The ML 
algorithm is trained using record-pairs known to be duplicates or non-duplicates. We 
independently used three established classification algorithms; support vector machines 
(SVMs) l25l, k-nearest neighbor (K-NN) Ql, and non-parametric Bayes classifier 
(NBC) 0. We used the Ismiorj^implementations of them due to the in-house expertise 

' www.logicblox.com 

^ For arbitrary sets of MDs, we need higher expressive power JT), such as that provided by 
answer set programming (3). 

^ http;//www.ismion.com 
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at LogicBlox. Since the emphasis of this work is on the use of LogiQL and MDs, we 
will refer only to our use of SVMs. 

We experimented with our ERBlox system using as dataset a snapshot of Microsoft 
Academic Search (MASj^(as of January 2013) including 250K authors and 2.5M pa¬ 
pers. It contains a training set. The experimental results show that our system improves 
ER accuracy over traditional blocking techniques ifTSll . which we will call standard 
blocking, where just blocking-key similarities are used. Actually, MD-based collective 
blocking leads to higher precision and recall on the given datasets. 

This paper is structured as follows. Section [^introduces background on matching 
dependencies and their semantics, and SVMs. A general overview of the ERBlox system 
is presented in Section]^ The specific components of ERBlox are discussed in Sections 
m andj^ Experimental results are shown in SectionjT] Sectionj^presents conclusions. 

2 Preliminaries 

2.1 Matching dependencies 

We consider an application-dependent relational schema TZ, with a data domain U. Eor 
an attribute A, DoniA is its finite domain. We assume predicates do not share attributes, 
but different attributes may share a domain. An instance D for 7?. is a finite set of ground 
atoms of the form i?(ci,..., c„), with R G TZ, Ci G U. 

We assume that each entity is represented by a relational predicate, and its tuples or 
rows in its extension correspond to records for the entity. As in I?!, we assume records 
have unique, fixed, global identifiers, rids, which are positive integers. This allows us to 
trace changes of attribute values in records. Record ids are placed in an extra attribute 
for R G TZ that acts as a key. Then, records take the form R{r, f), with r the rid, and 
f = (ci,..., c„). Sometimes we leave rids implicit, and sometimes we use them to 
denote whole records; if r is a record identifier in instance D, f denotes the record in 
D identified by r. Similarly, if .4 is a sublist of the attributes of predicate R, then r[A] 
denotes the restriction of f to A. 

MDs are formulas of the form: Ri[Xi\Ki R 2 [X 2 \ -G = i? 2 [V 2 ] ifThlfTTIl . 

Here, i?i,i ?2 G TZ (and may be the same); and Xi,X 2 are lists of attribute names 
of the same length that are pairwise comparable, that is, X\ and X\, and also Tj, Y 2 , 
share the same domainj^ The MD says that, for every pair of tuples (one in relation 
Ri, the other in relation R 2 ) where the LHS is true, the attribute values in them on the 
RHS have to be made identical. Symbol « denotes generic, reflexive, symmetric, and 
application/domain dependent similarity relations on shared attribute domains. 

A dynamic, chase-based semantics for MDs with matching functions (MEs) was 
introduced in Q. Given an initial instance D, the set E of MDs is iteratively enforced 
until they cannot be be applied any further, at which point a resolved instance has been 
produced. In order to enforce (the RHSs of) MDs, there are binary matching functions 
(MEs) m-A : DoniA x DorriA —>■ Dottia', and mA{a,a') is used to replace two values 
a, a' G DorriA that have to be made identical. MEs are idempotent, commutative, and 

http://academic.research.microsoft.com. For comparison, we also tested our system with data 
from DBLP and Cora. 

^ A more precise notation for the MD would be: (A, Ri lx\] R 2 [xi] 

A,Ri[y1]=R2[yl]). 
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associative, and then induce a partial-order structure {Doitia, diA), with: a :<a a' 
mA{a,a') = a' Hi a. It always holds: a, a' :<a mA{a,a'). In this work, MFs are 
treated as built-in relations. 

There may be several resolved instances for D and S. However, when (a) MFs are 
similarity-preserving (i.e., a ~ a' implies a ~ mA{a', a")); or (b) E is interaction-free 
(i.e., each attribute may appear in either the RHS or LHS of MDs in E), there is a 
unique resolved instance that is computable in polynomial time in \D\ Q. 

2.2 Support vector machines 

The SVMs technique ll25l is a form of kernel-based learning. SVMs can be used for 
classifying vectors in an inner-product vector space V over K. Vectors are classified 
in two classes, with a label in {0,1}. The algorithm learns from a training set, say 
{(ei, /(ei)), ( 62 , /(ea)), (eg, /(eg)), ..., (e„, /(e„))}. Here, e, e V, and for the 
feature (function) /: /(e^) G { 0 , 1 }. 

SVMs find an optimal hyperplane, H, in V that separates the two classes where the 
training vectors are classified. Hyperplane "H has an equation of the form w • x + 6 , 
where • denotes the inner product, x is a vector variable, w is a weight vector of real 
values, and 6 is a real number. Now, a new vector e in V can be classified as positive or 
negative depending on the side of it lies. This is determined by computing h(e) := 
sign{w • e + b). If h{e) > 0, e belongs to class 1; otherwise, to class 0. 

It is possible to compute real numbers ai,..., a„, such that the classifier h can be 
computed through: h{e) = sign{J2i on ' ■ ei • e + b) (cf. Figure]^. 
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Fig. 1. Overview of ERBlox 
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3 Overview of ERBlox 

A high-level description of the components of ERBlox is given in Figure It shows the 
workflow supported by ERBlox when doing ER. ERBlox’s three main components are: 
(1) MD-based collective blocking (path 1, 3, 5, {6, 8}), (2) ML-based record duplicate 
detection (the whole initial workflow up to task 13, inclusive), and (3) MD-based merg¬ 
ing (path 14,15). In the figure, all the boxes in light grey are supported by LogiQL. As 
just done, in the rest of this section, numbers in boldface refer to the edges in this figure. 

The initial input data is stored in structured text files. 

(We assume these data are already standardized and free of 

misspellings, etc., but duplicates may be present.) Our general LogiQL program that 
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supports the whole workflow contains some rules for importing data from the files into 
the extensions of relational predicates (think of tables, this is edge 1). This results in a 
relational database instance T containing the training data (edge 2), and the instance D 
on which ER will be performed (edge 3). 

The next main task is blocking, which 
requires similarity computation of pairs of 
records in D (edge 5). For record pairs 
in T, similarities have to be com¬ 
puted as well (edge 4). Similarity computa¬ 
tion is based on similarity functions, Sf ^ : 

DoruAi X DoruAi —>■ [ 0 , 1 ], each of which 
assigns a numerical value, called similarity 
weight, to the comparisons of values for a 
record attribute Ai (from a pre-chosen subset of attributes) (cf. Figure]^. A weight vec¬ 
tor w{ri,r 2 ) = (•••, Sf ^{ri[Ai], r 2 [Ai]), • • •) is formed by similarity weights (edge 
7). For more details on similarity computation see Sectionj^ 

Since some pairs in T are considered to be duplicates and others non-duplicates, the 
result of this process leads to a “similarity-enhanced” database T® of tuples of the form 
(ri, r 2 , w{ri,r 2 ), L), with label L G {0,1} indicating if the two records are duplicates 
(L = 1) or not (L = 0). The labels are consistent with the corresponding weight vectors. 
The classifier is trained using T®, leading to a classification model (edges 9,10). 

For records in D, similarity measures are needed for blocking, to decide if two 
records ri, r 2 go to the same block. Initially, every record has its rid assigned as block 
(number). To assign two records to the same block, we use matching dependencies that 
specify and enforce (through their RHSs) that their blocks have to be identical. This 
happens when certain similarities between pairs of attribute values appearing in the 
LHSs of the MDs hold. For this reason, similarity computation is also needed before 
blocking (workflow 5, 6 , 8 ). This similarity computation process is similar to the one 
for T. However, in the case of D, this does not lead directly to the same kind of weight 
vector computation. Instead, the computation of similarity measures is only for the 
similarity predicates appearing in the LHSs of the blocking-MDs. (So, as the evaluation 
of the LHS in ([T]i requires the computation of similarities for dept-stnng values.) 

Notice that these blocking-MDs may capture semantic knowledge, so they could 
involve in their LHSs similarities of attribute values in records for different kinds of 
entities. For example, in relation to Example [T] there could be similarity comparisons 
involving attributes for entities Author and Paper, e.g. 

Author{xi, 2/1, bh) A Paper{yi, zi, bis) A Author{x2,y2, bh) A 


r, 



w(ri,r2) = <Wi(fi(r,,r2)), ... > 

Fig. 2. Feature-based similarity 


Paper{y2, Z2, bU) A xi «i xs A zi «2 Z2 — >■ bli = bh, (2) 

expressing that when the similarities on the LHS hold, the blocks bli, 6(2 have to be 
made identically The similarity comparison atoms on the LHS are considered to be true 
when the similarity values are above predefined thresholds (edges 5, 8 )Q 
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These MDs are more general than those introduced in Section 2.1 they may contain regular 


database atoms, which are used to give context to the similarity atoms in the same antecedent. 
At this point, since all we want is to do blocking, and not yet decisions about duplicates, we 
could, in comparison with what is done with pairs in T, compute less similarity measures and 
and even with low thresholds. 
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This is the MD-based collective blocking stage that results in database D enhanced 
with information about the blocks to which the records are assigned. Pairs of records 
with the same block form candidate duplicate record pairs, and any two records with 
different blocks are simply not tested as possible duplicates (of each other). 

After the records have been assigned to 
blocks, pairs of records {ri,r 2 ) in the same block 
are considered for the duplicate test. As this point 
we proceed as we did for T: the similarity vectors 
w{ri,r 2 ) have to be computed (edges ll, 12 )j^ 

Next, tuples {ri,r 2 ,w{ri,r 2 )) are used as input 
for the trained classification algorithm (edge 12 ). 

Fig. 3. Classification hyperplane 

The result of the trained ML-based classifier, in this case obtained through SVMs 
as a separation hyperplane %, is a set M of record pairs (ri, r 2 ,1) that come from the 
same block and are considered to be duplicates (edge 13)j^The records in these pairs 
will be merged on the basis of an ad hoc set of MDs (edge 15), different from those 
used in edges 6 , 8 . 

Informally, the merge-MDs are of the form: ri « r 2 —>■ ri = r 2 , where the 
antecedent is true when (ri, r 2 ,1) is an output of the classifier. The RHS is a shorthand 
for: ri [Ai] = r 2 [Ai] A • • • A ri [Am] = ^2 [Am\, where m is the total number of record 
attributes. Merge at the attribute level uses the matching functions m. 4 .. 

We point out that MD-based merging takes care of transitive cases provided by 
the classifier, e.g. if it returns (ri, r 2 , 1 ), (r 2 , r^, 1 ), but not (ri, ra, 1 ), we still merge 
rijT-a (even when ri « does not hold). Actually, we do this by by merging all 
the records ri,r 2 ,r 3 into the same record. Our system is capable of recognizing this 
situation and solving it as expected. This relies on the way we store and manage -via 
our LogiQL program- the positive cases obtained from the classifier (details can be 
found in Section]^. In essence, this makes our set of merging-MDs interaction-free, 
and leads to a unique resolved instance jT] . 

The following sections provide more details on ERBlox and our approach to ER. 

4 Initial Data and Similarity Computation 

We describe now some aspects of the MAS dataset, highlighting the input for- and out¬ 
put of each component of the ERBlox system. The data is represented and provided as 
follows. The Author relation contains authors names and their affiliations. The Paper 
relation contains paper titles, years, conference IDs, journal IDs, and keywords. The 
PaperAuthor relation contains papers IDs, authors IDs, authors names, and their affilia¬ 
tions. The Journal and Conference relations contain short names, full names, and home 
pages of journals and conferences, respectively. By using ERBlox on this dataset, we 
determine which papers in MAS data are written by a given author. This is clear case of 


° Similarity computations are kept in appropriate program predicates. So similarity values com¬ 
puted before blocking can be reused at this stage, or whenever needed. 

® The classifier also returns pairs or records that come from the same block, but are not consid¬ 
ered to be duplicate. The set thereof in not interesting, at least as a workflow component. 
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Author 

AW 

Name 

Affiliation 

Bl# 


659 

Jean-Pierre Olivier de 

Ecole des Hautes 

659 


2546 

Olivier de Sardan 

Recherche Scientifique 

2546 


612 

Matthias Roeckl 

German Aerospace Center 

612 


4994 

Matthias Roeckl 

Institute of Communications 

4994 


Paper 

PID 

Title 

Year 

CID 

JID 

Keyword 

Bl# 


123 

Illness entities in West Africa 

1998 

179 


West Africa, Illness 

123 


205 

Illness entities in Africa 

1998 

179 


Africa, Illness 

205 


769 

DLR Simulation Environment m3 

2007 

146 


Simulation m3 

769 


195 

DLR Simulation Environment 

2007 

146 


Simulation 

195 


P aperAuthor 

PID 

AW 

Name 

Affiliation 


123 

659 

Jean-Pierre Olivier de 

Ecole des Hautes 


205 

2546 

Olivier de Sardan 

Recherche Scientifique 


769 

612 

Matthias Roeckl 

German Aerospace Center 


195 

4994 

Matthias Roeckl 

Institute of Communications 


Fig. 4. Relation extensions from MAS using LogiQL rules 

ER since there are many authors who publish under several variations of their names. 
Also the same paper may appear under slightly different titles, etcp^ 

From the MAS dataset, which contains the data in structured files, extensions for 
intentional, relational predicates are computed by LogiQL-m\ss of the general program, 

e-g- 

_file_m{xl, x2, x3) —>■ string{xl), string{x2), string{x3). (3) 

lang : physical : filePath[_file_in] = ’’ author.csv”. (4) 

+ author{idl,x2,x3) t— _file-in{xl, x2, x3), string: intGi: convert[xl] = idl. (5) 

Here, ([^ is a predicate schema declaration (metadata uses in this case of the 

“_file_in” predicate with three string-valued attributesj^which is used to store the con¬ 
tents extracted from the source file, whose path is specified by 0. Derivation rules, 
such as 0 , use the usual In this case, it defines the author predicate, and the “-f” 
in the rule head inserts the data into the predicate extension. The first attribute is made 
an identifier Hi. Figureillustrates a small part of the dataset obtained by importing 
data into the relational predicates. (There may be missing attributes values.) 

As described above, in ERBlox, similarity computation generates similarity weights, 
which are used to: (a) compute the weight vectors for the training data T and the data in 
D under classification; and (b) do the blocking, where similarity weights are compared 
with predefined thresholds for the similarity conditions in the LHSs of blocking-MDs|0 
We used three well-known similarity functions ifTSll . depending on the attribute do¬ 
mains. “TF-IDF cosine similarity” 1231 used for computing similarities for text-valued 
attributes, whose values are string vectors. It assigns low weights to frequent strings 
and high weights to rare strings. It was used for attribute values that contain frequent 
strings, such as affiliation. For attributes with short string values, such as author name, 
we applied “Jaro-Winkler similarity” l2^ . Finally, for numerical attributes, such as 
publication year, we used “Levenshtein distance” ED, which computes similarity of 

For our experiments, we independently used two other datasets: DBLP and Cora Citation. 

*’ In LogiQL, each predicate has to be declared, unless it can be inferred from the rest of the 
program. 

As described at the end of Section]^ these similarity computations are not used with the MDs 
that support the final merging process (cf. Sectionj^. 


8 






























two numbers on the basis of the minimum number of operations required to transform 
one into the other. 

Similarity computation for ERBlox is supported by LogiQL-ml&s that define simi¬ 
larity functions. In particular, similarity computations are kept in extensions of program 
predicates. For example, if the similarity weight of values oi, 02 for attribute Title is 
above the threshold, a tuple TitleSim{ai,a 2 ) is created by the program. 

5 MD-Based Collective Blocking and Duplicate Detection 

Since every record has an identifier, rid, initially each record uses its rid as its block 
number, in an extra attribute Bl^. In this way, we create the initial blocking instance 
from the initial instance D, also denoted with D. Now, blocking strategies are captured 
by means of (blocking) MDs of the form: 

R,iXi, Bh) A R,{X2, BI2) A ^ Bh = Bh- ( 6 ) 

Here Bli, BI 2 are variables for block numbers, and Ri is a database (record) predicate. 
The lists of variables Xi, X 2 stand for all the attributes in Ri, but Bl^. Formula ip is 
a conjunction of relational atoms and comparison atoms via similarity predicates; but it 
does not contain similarity comparisons of blocking numbers, such as BI3 « Bl 4 ^The 
variables in the list X 3 appear in Ri or in another database predicate or in a similarity 
atom. It holds that {Xi U X 2 ) H X 3 ^ 0. For an example, see (j^, where Ri is Author. 
In order to enforce these MDs on two records, we use a binary matching function 
to make two block numbers identical: {i, j) := i if j < *■ More generally, for 

the application-dependent set, of blocking-MDs we adopt the chase-based seman¬ 
tics for entity resolution Q. Since this set of MDs is interaction-free, its enforcement 
results in a single instance where now records may share block numbers, in which 
case they belong to the the same block. Every record is assigned to a single block. 

Example 2. These are some of the blocking-MDs used for the MAS dataset: 

Pa’peri^id^ ^ xi, yi, zi, wi, vi, bli) A Paper {pid 2 yX 2 ,y 2 -, ^ 2 , ’*^ 2 , ‘^^ 2 , bl 2 ) A (7) 

Xi ^Titie ^2 A y\ — y 2 A 2 i = 22 —>■ bli = bh- 

Author{aidi, xi, yi,, bli) A Author {aid 2 , X 2 , y 2 : bl 2 ) A (8) 

Xl ~Name ^2 A 7/1 2/2 bli = b^. 

Paper{pidi,Xi,yi, zi,Wi,vi, bli) A Paper{pid2, X2, y2i ^2:'^2-,'^2i ^>^ 2 ) A (9) 

Paper Author {pid ^, aidi ^x-^^y^) A PaperAuthor{pid 2 , aid 2 , 212 , 2 / 2 ) ^ 

Author{aidi, x-^^, y^, bl^) A Author{aid 2 , 212 , 2/2 > A aii ^Title X 2 —>■ bli^bl 2 - 

Author{aidi, xi, yi, bli) A Author{aid 2 , X 2 , y 2 , bl 2 ) A xi ^Name X 2 A (10) 

Paper Author {pid-i , aidi, xi,yi) A PaperAuthor{pid 2 , aid 2 , X 2 ,y 2 ) A 
Paper{pidi, x-^^, y^, z-^^, bl^) A Paper{pid2 , rc2, 2/2 j -^2 ’ ‘^2 t'^ 2^ bli ^ bl2 ■ 

Informally, 0 tells us that, for every two Paper entities pi,P2 for which the values 
for attribute Title are similar and with same publication year, conference ID, the values 
for attribute Bl^ must be made the same. By ([^, whenever there are similar values for 
name and affiliation in Author, the corresponding authors should be in the same block. 
Furthermore, 0 and ( [T0| ) collectively block Paper and Author entities. For instance, Q 

Actually, this natural condition makes the set of blocking-MDs interaction-free, i.e. for every 
two blocking-MDs mi, m 2 , the set of attributes on the RHS of mi and the set of attributes on 
the LHS of m 2 on which there are similarity predicates, are disjoint 13 
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states that if two authors are in the same block, their papers pi, p 2 having similar titles 
must be in the same block. Notice that if papers pi and p 2 have similar titles, but they 
do not have same publication year or conference ID, we cannot block them together 
using Q alone. ■ 

We now show how these MDs are represented in LogiQL, and how we use LogiQL 
programs for declarative specification of MD-based collective blockingj^In LogiQL, 
an MD takes the form: 

R,[Xi\ = Bl2, Ri\X2\ = Bl2 ^ Ri[XL\ = Bh, Ri[X2\ = Bh, Bh < Bh, 

r-i - 

subject to the same conditions as in (™. An atom Ri[X]=Bl states that predicate is 
functional on X H]. It means each record in Ri can have only one block number 

Given an initial instance D, a LogiQL program V^{D) that specifies MD-based 
collective blocking contains the following (kind of) rules: 

1. For every atom R{rid, x, hi) G D, the fact R[rid, x] = bl. (Initially, bl := rid.) 

2. For every attribute A of Ri, facts of the form A-Sim{ai, 02 ), with oi, 02 S DorriA, 
the finite attribute domain. They are obtained by similarity computation. 

3. The blocking-MDs as in GD- 

4 . Rules to represent the consecutive versions of entities during MD-enforcement: 

R-OldVersion{ri,xi,bh) R\ri,x{\ = bh, = bh, bh < bh- 

For each rid, r, there could be several atoms of the form R\r, x] = bl, corresponding to 
the evolution of the record identified by r due to MD-enforcement. The rule specifies 
that versions of records with lower block numbers are old. 

5. Rules that collect the latest versions of records. They are used to form blocks: 

R-MDBlock[ri,xi] = bh Rlri,xi] = bh, R-OldVersion{rl,Xl,bh)■ 
ln LogiQL, “!”, as in the body above, is used for negation IT]. The rule collects R- 
records that are not old versions. 

Programs V^{D) as above are stratified (there is no recursion involving negation). 
Then, as expected in relation to the blocking-MDs, they have a single model, which can 
be used to read the final block number for each record. 

Example 3. (ex.|^cont.) Considering only MDs Q and (j^, the portion of V^{D) for 
blocking Paper entities has the following rules: 

2. Facts such aS! TitleSim{Illness entities in West Africa, Illness entities in Africa). 

TitleSim{DLR Simulation Environment m3, DLR Simulation Environment). 

3 . Paper[pid-^: Vit j '*^1] = ^^2) Paper[pid2 , ) 2/2 ? -22) ’*^2 : '^2] — ^^2 

Paper[pid-j^,xi, yi, zi,wi, = 6 / 1 , Paper[pid2, 2 : 2 , 2 / 2 , ^ 2 , ^^ 2 , '^2] — 

TitleSim{x\, 2 : 2 ), 2/1 = 2 / 2,21 = 22 , hl\ < bl2. 
Paper[pidY-, zi, wi , i?i] = 6Z2) Paper[pid2 , a^2) 2/2 ^ 22,102, "112] = bl2 •<— 

Paper[pidi ,^ci,2/i5 2i,ii;i,Di] = bli, Paper[pid2 , ai2,2/2 ^ 22, UJ2 ,'^2] — bl2-, TitleSim{xi, X2), 
Paper Author {pid I, aid\, x'^ ,2/1)5 Pdp^T Author {pid2 , a 2 d 2 ,2I2,2/2) > 
Author[aid-\_ ,x'-^,y'i\ = bl^, Author[aid2, X2,y2] = bl^, bli < 6Z2. 

Notice that since we have interaction-free sets of blocking-MDs, stratified Datalog programs 
are expressive enough to express and enforce them O. LogiQL supports stratified Datalog. 
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4. P a'perOldVersion{pid I Paper[pid^, z\,w\^v\] — 6Zi, 

Paper[pid-^ = bl 2 -, bli < bl 2 - 

5 . PaperMDSZocfcfpid, 5 i] = &Zi ■<— Paperfpzd]^, Xi, yi, 2 i, lUi, ui] = bli, 

PaperOldVersion{pidi, xi, yi, zi, wi, vi, bli). 

Restricting the model of the program to the relevant attributes of predicate PaperMD- 
Block returns; {{123, 205}, {195, 769}}, i.e. the papers with pids 123 and 205 are 
blocked together; similarly for those with pids 195 and 769. ■ 

As described above, the input to the trained classifier is a set of tuples of the form 
{ri,r 2 ,w{ri,r 2 )), with w{ri,r 2 ) the computed weight vector for records (with ids) 
ri, r 2 in a same blockj*^ 

Example 4. (ex.|^cont.) Consider the blocks for entity Paper. If the “journal ID” val¬ 
ues are null in both records, but not the “conference ID” values, “journal ID” is not 
considered for a feature. Similarly, when the conference ID values are null. However, 
the values for “journal ID” and “conference ID” are replaced by “journal full name” 
and “conference full name” values, found in Conference and Journal records, resp. In 
this case then, attributes Title, Year, ConfFullName or JourFullName and Keyword are 
used for corresponding feature for weight vector computation. 

Considering the previous Paper records, the input to the classifier consists of: (123, 
205, ti;(123,205)), with u;(123,205) = [0.8,1.0,1.0,0.7], and (195, 769, ^(195,769)), 
with i(;(195, 769) = [0.93,1.0,1.0, 0.5] (actually the contents of the two square brack¬ 
ets only). ■ 

Several ML techniques are accessible from LogicBlox platform through the BloxML- 
Pack library, that provides a generic Datalog interface. Then, ERBlox can call an ML- 
based record duplicate detection component through the general LogiQL program. In 
this way, the SVMs package is invoked by ERBlox. 

The output is a set of tuples of the form (ri, r 2 , 1) or (ri, r 2 , 0), where ri, 7-2 are ids 
for records of entity (table) R. In the former case, a tuple R-Duplicate{ri , r 2 ) is created 
(as defined by the LogicQL program). In the previous example, the SVMs method return 
([0.8,1.0,1.0, 0.7], 1) and ([0.93,1.0,1.0, 0.5], 1), then PaperDuplicate{l23, 205) and 
PaperDuplicate{195, 769) are created. 

6 MD-Based Merging 

When EntityDuplicate{ri,r 2 ) is created, the corresponding full records fi, r 2 have to 
be merged via record-level merge-MDs of the form R\ri\ ~ 7?[r2] —> 7?[7i] = 
R[r 2 \, where R[ri] « R[r 2 ] is true when R-Duplicate{ri,r 2 ) has been created ac¬ 
cording to the output of the SVMs classifier. The RHS means that the two records are 
merged into a new full record f, with f[Ai] := mAri[A,],r2[Ai\) lUl. 

Example 5. (ex.l^cont.) We merge duplicate Paper entities enforcing the MD; Paper 
[pid^] ~ Paper[pid 2 ] —> Paper[Title, Year, CID, Keyword] = Paper[Title, Year, CID, 
Keyword], I 

The features considered in a weight vector computation depend on whether they have a strong 
discrimination power, i.e. do not contain missing values. 
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The portion, 7^^, of the general LogiQL program that represents MD-based merging 
contains rules as in 1.-4. below; 

1. The atoms of the form R-Duplicate mentioned above, and those representing the 
matching functions (MFs) m^. 

2. For an MD « R[r 2 ] —S" ^[^i] = R[^ 2 ], the rule; 

_R[ri, 2 ; 3 ] = bl, -R[r 2 ,;E 3 ] = bl <— R-Duplicate{ri,r 2 ), = bl, 

R[r2,X2] = bl, m{xi,X2) = Xs, 

which creates two records (one of them can be purged afterwards) with different ids but 
all the other attribute values the same, and computed componentwise according to the 
MFs for m. Here, Xi,X 2 , x^ stand each for all attributes of relation R, except for the id 
and the block number (represented by hi). (Block numbers play no role in merging.) 

3. As for program V^{D) given in Section]^ rules specify the old versions of a record; 

R-OldVersion{ri,Xi) ■<— i?[ri,a:i] = bl, R[ri,X2\ = bl, Xi -< X2. 

Here, xi stands for all attributes other than the id and the block number; and on the RHS 
xi -< X 2 means componentwise comparison of values according to the partial orders 
defined by the MFs. 

4. Finally, rules to collect the latest version of each record, building the final resolved 
instance; R-ER{ri,xi) R[ri,xi] = bl, \ R-OldVersion(ri,xi). 

Notice that the derived tables R-Duplicate that appear in the LHSs of the MDs 
(or in the bodies of the corresponding rules) are all computed before (and kept fixed 
during) the enforcement of the merge-MDs. In particular, a duplicate relationship be¬ 
tween any two records is not lost. This has the effect of making the set of merging-MDs 
interaction-free, which results in a unique resolved instance. 

7 Experimental Evaluation 

We now show that our approach to ER can improve accuracy in comparison with stan¬ 
dard blocking. In addition to the MAS, we used datasets from DBLP and Cora Citation. 

In order to emphasize the importance 
of semantic knowledge in blocking, we 
consider standard blocking and two dif¬ 
ferent sets of MDs, (1) and (2), for MD- 
based collective blocking. Under (1), we 
define blocking-MDs for all the blocking 
keys used for standard blocking, but un¬ 
der (2) we have MDs for only some of 
the used blocking keys. In both cases, in 
addition to properly collective blocking 
MDs. 

We use three measures for the com¬ 
parisons of blocking techniques. One is reduction ratio, which is the the ratio (minus 1) 
of the number of candidate record-pairs over the initial number of records. The higher 



Reduction Ratio 
-•-Recall 
-*- Precsioin 


Fig. 5. The experiments (MAS) 
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this value, the less candidate record-pairs are being generated, but the quality of the gen¬ 
erated candidate record pairs is not taken into account. We also use recall and precision 
measures. The former is the number of true duplicate candidate record-pairs divided 
by the number of true duplicate pairs, and precision is the number of true candidate 
duplicate record-pairs divided by the total number of candidate pairs d. 

Figures and [7] show the comparative performance of ERBlox. They show that 

standard blocking has higher reduction ratio than MD-based collective blocking version 
(1). This means that less candidate record-pairs are being generated by standard block¬ 
ing. However, the precision and recall of MD-based blocking version (1) are higher than 
standard blocking, meaning that MD-based blocking version (1) can lead to improved 
ER results at the cost of larger blocks, and thus more candidate record pairs that need 
to be compared. 

In blocking, this is a common trade¬ 
off that needs to be considered. On the 
one hand, having a large number of 
smaller blocks will result in fewer candi¬ 
date record-pairs that will be generated, 
probably increasing the number of true 
duplicate record-pairs that are missed. 

On the other hand, blocking techniques 
that result in larger blocks generate a 
higher number of candidate record-pairs 
that will likely cover more true duplicate 
pairs, at the cost of having to compare 

more candidate pairs ifT^ . The experiments are all done before MD-based merging. 

Interestingly, MD-based blocking ver¬ 
sion (2) has higher reduction ratio, re¬ 
call, and precision than standard block¬ 
ing. This emphasizes the importance of 
MDs supporting collective blocking, and 
shows that blocking based on string sim¬ 
ilarity alone fails to capture the relation¬ 
ships that naturally hold in the data. 

As expected, the experiments show 
that different sets of MDs for MD-based 
collective blocking have different impact on reduction ratio, so as standard blocking 
depends on the choice of blocking keys. However, the quality of MD-based collective 
blocking, in its two versions, dominates standard blocking for the three datasets. 

8 Conclusions 

We have shown that matching dependencies, a new class of data quality/cleaning se¬ 
mantic constraints in databases, can be profitably integrated with traditional ML-methods, 
in our case for entity resolution. They play a role not only in the intended goal of merg¬ 
ing duplicate representations, but also in the record blocking process that precedes the 
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Fig. 7. The experiments (Cora) 
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Fig. 6 . The experiments (DBLP) 
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learning task. At that stage they allow to declaratively capture semantic information 
that can be used to enrich the blocking activity. MDs declaration and enforcement, data 
processing in general, and machine learning can all be integrated using the LogiQL 
language. 
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